Day 25 [Python ML、資料清理] 處理遺失值

2021 iThome 鐵人賽

DAY 25

AI & Data

使用python學習Machine Learning系列第 25 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-10-08 10:48:50

2090 瀏覽

分享至

一開始要先看資料

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("./NFL Play by Play 2009-2017 (v4).csv")

# set seed for reproducibility
np.random.seed(0)

/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (25,51) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

當一開始取得資料後，要確保資料有沒有空缺值，NaN或是None

# look at the first five rows of the nfl_data file.
# I can see a handful of missing data already!
nfl_data.head()

發現在果然有空缺值

有多少缺失資料

現在我們要來看全部的資料會有多少空缺值

# get tje number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points int the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

我們來看看空缺值佔全部資料的多少比例

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells)*100
print(percent_missing)

24.87214126835169

np.product是將值全部乘起來
nfl_data.shape是取得資料的值，取出來後的資料型態是tuple

找出為什麼會有空缺值

這一個部分稱為對資料的直覺，也就是說看資料就要知道為什麼會有空缺值

若是新手的話可以先思考這個問題

Is this value missing because it wasn't recorded or because it doesn't exist?

若資料是本來就不存在，例如說要問一個人他年紀最大的孩子多高但是這個人沒有孩子，那就讓資料維持NaN

若是資料是漏紀錄，那就去猜測他的值應該會是甚麼

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

從資料中看起來應該是漏紀錄而非不存在

因此我們要想辦法猜出NA的資料應該會是什麼

但是有一個有很多空缺值是因為那個資料是隊伍的罰款

有些隊伍的確是沒有罰款，因此還是要將其當為空值

去掉空缺值

若是沒有任何原因要找出為什麼值會缺失，有一個方法是直接將有缺失值的row或column去掉

若是確定要這樣做的話，pandas有一個便利的function，dropna()可以解決這個問題

# remove all the rows that contain a missing value
nfl_data.dropna()

dropna()會移除掉所有的資料，那是因為所有的row都有空缺值

因此我們只要選擇將column中有空缺值的去掉就可以了

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 41

我們失去了一些data，但是已經沒有空缺值了

自動填上空缺值

我們先從資料中取得一小部分資料

# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

可以使用Panda's fillna()這個function可以將空缺值填入

我們可以選擇要將什麼值填入NaN，在這邊我們將值填入0

# replace all NA's with 0
subset_nfl_data.fillna(0)

我們也可以將丟失值替換為某一些緊隨其後的值

(這樣的方法對於某些邏輯數據集來說很有意義)

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)